%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
Data Preprocessing / Data Cleaning¶
Clearing duplicates , verifying the data types, check missing values i.e preparing data for analysis
os.listdir(r"D:\Datasets")
['other-American_B01362.csv', 'other-Carmel_B00256.csv', 'other-Dial7_B00887.csv', 'other-Diplo_B01196.csv', 'other-Federal_02216.csv', 'other-FHV-services_jan-aug-2015.csv', 'other-Firstclass_B01536.csv', 'other-Highclass_B01717.csv', 'other-Lyft_B02510.csv', 'other-Prestige_B01338.csv', 'other-Skyline_B00111.csv', 'Uber-Jan-Feb-FOIL.csv', 'uber-raw-data-apr14.csv', 'uber-raw-data-aug14.csv', 'uber-raw-data-janjune-15.csv', 'uber-raw-data-janjune-15_sample.csv', 'uber-raw-data-jul14.csv', 'uber-raw-data-jun14.csv', 'uber-raw-data-may14.csv', 'uber-raw-data-sep14.csv']
uber_15 = pd.read_csv(r"D:\Datasets\uber-raw-data-janjune-15_sample.csv")
uber_15.shape
(100000, 4)
type(uber_15)
pandas.core.frame.DataFrame
uber_15.duplicated()
0 False
1 False
2 False
3 False
4 False
...
99995 False
99996 False
99997 False
99998 False
99999 False
Length: 100000, dtype: bool
uber_15.shape
(100000, 4)
uber_15.drop_duplicates(inplace = True)
uber_15.duplicated().sum()
np.int64(0)
uber_15.shape
(99946, 4)
uber_15.dtypes
Dispatching_base_num object Pickup_date object Affiliated_base_num object locationID int64 dtype: object
uber_15.isnull().sum()
Dispatching_base_num 0 Pickup_date 0 Affiliated_base_num 1116 locationID 0 dtype: int64
uber_15["Pickup_date"][0]
'2015-05-02 21:43:00'
type(uber_15["Pickup_date"][0])
str
uber_15["Pickup_date"] = pd.to_datetime(uber_15["Pickup_date"])
uber_15["Pickup_date"].dtype
dtype('<M8[ns]')
uber_15["Pickup_date"][0]
Timestamp('2015-05-02 21:43:00')
type(uber_15["Pickup_date"][0])
pandas._libs.tslibs.timestamps.Timestamp
uber_15.dtypes
Dispatching_base_num object Pickup_date datetime64[ns] Affiliated_base_num object locationID int64 dtype: object
If you want to cross check using code:
np.dtype('datetime64[ns]') == np.dtype('<M8[ns]')
Categorical data has : Object & bool data-types Numerical data have : Integer & Float data-type
Categorical data refers to a data type that can be stored into groups/categories/labels
Examples of categorical variables are age group, blood type etc..
Numerical data refers to the data that is in the form of numbers,
Examples of numerical data are height, weight, age etc..
Numerical data has two categories: discrete data and continuous data
Analyzing which month has max uber pickup¶
uber_15
| Dispatching_base_num | Pickup_date | Affiliated_base_num | locationID | |
|---|---|---|---|---|
| 0 | B02617 | 2015-05-02 21:43:00 | B02764 | 237 |
| 1 | B02682 | 2015-01-20 19:52:59 | B02682 | 231 |
| 2 | B02617 | 2015-03-19 20:26:00 | B02617 | 161 |
| 3 | B02764 | 2015-04-10 17:38:00 | B02764 | 107 |
| 4 | B02764 | 2015-03-23 07:03:00 | B00111 | 140 |
| ... | ... | ... | ... | ... |
| 99995 | B02764 | 2015-04-13 16:12:00 | B02764 | 234 |
| 99996 | B02764 | 2015-03-06 21:32:00 | B02764 | 24 |
| 99997 | B02598 | 2015-03-19 19:56:00 | B02598 | 17 |
| 99998 | B02682 | 2015-05-02 16:02:00 | B02682 | 68 |
| 99999 | B02764 | 2015-06-24 16:04:00 | B02764 | 125 |
99946 rows × 4 columns
uber_15['Pickup_date'].dt.month
0 5
1 1
2 3
3 4
4 3
..
99995 4
99996 3
99997 3
99998 5
99999 6
Name: Pickup_date, Length: 99946, dtype: int32
uber_15['month'] = uber_15['Pickup_date'].dt.month_name()
uber_15['month'].value_counts()
month June 19620 May 18660 April 15982 March 15969 February 15896 January 13819 Name: count, dtype: int64
uber_15['month'].value_counts().plot()
<Axes: xlabel='month'>
uber_15['month'].value_counts().plot(kind = "bar")
<Axes: xlabel='month'>
Weekwise Analyzing maximum Uber pickups per month using pivot tabel¶
uber_15['weekday'] = uber_15['Pickup_date'].dt.day_name()
uber_15['hours'] = uber_15['Pickup_date'].dt.hour
uber_15['mins'] = uber_15['Pickup_date'].dt.minute
uber_15.head(6)
| Dispatching_base_num | Pickup_date | Affiliated_base_num | locationID | month | weekday | hours | mins | |
|---|---|---|---|---|---|---|---|---|
| 0 | B02617 | 2015-05-02 21:43:00 | B02764 | 237 | May | Saturday | 21 | 43 |
| 1 | B02682 | 2015-01-20 19:52:59 | B02682 | 231 | January | Tuesday | 19 | 52 |
| 2 | B02617 | 2015-03-19 20:26:00 | B02617 | 161 | March | Thursday | 20 | 26 |
| 3 | B02764 | 2015-04-10 17:38:00 | B02764 | 107 | April | Friday | 17 | 38 |
| 4 | B02764 | 2015-03-23 07:03:00 | B00111 | 140 | March | Monday | 7 | 3 |
| 5 | B02617 | 2015-05-03 19:42:00 | B02617 | 87 | May | Sunday | 19 | 42 |
## Creating the grouped bar chart from the pivot table
pivot = pd.crosstab(index= uber_15['month'], columns = uber_15['weekday'])
pivot
| weekday | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday |
|---|---|---|---|---|---|---|---|
| month | |||||||
| April | 2365 | 1833 | 2508 | 2052 | 2823 | 1880 | 2521 |
| February | 2655 | 1970 | 2550 | 2183 | 2396 | 2129 | 2013 |
| January | 2508 | 1353 | 2745 | 1651 | 2378 | 1444 | 1740 |
| June | 2793 | 2848 | 3037 | 2485 | 2767 | 3187 | 2503 |
| March | 2465 | 2115 | 2522 | 2379 | 2093 | 2388 | 2007 |
| May | 3262 | 1865 | 3519 | 2944 | 2627 | 2115 | 2328 |
pivot.plot(kind='bar', figsize=(8,6))
<Axes: xlabel='month'>
- January → Saturday
- February → Friday
- March → Saturday
- April → Thursday
- May → Saturday
- June → Tuesday
Analyzing hourly rush in new york city all day¶
summary = uber_15.groupby(['weekday', 'hours']).size().reset_index(name='size')
summary
| weekday | hours | size | |
|---|---|---|---|
| 0 | Friday | 0 | 581 |
| 1 | Friday | 1 | 333 |
| 2 | Friday | 2 | 197 |
| 3 | Friday | 3 | 138 |
| 4 | Friday | 4 | 161 |
| ... | ... | ... | ... |
| 163 | Wednesday | 19 | 1044 |
| 164 | Wednesday | 20 | 897 |
| 165 | Wednesday | 21 | 949 |
| 166 | Wednesday | 22 | 900 |
| 167 | Wednesday | 23 | 669 |
168 rows × 3 columns
plt.figure(figsize=(8,6))
sns.pointplot(x='hours', y='size',hue='weekday',data=summary)
plt.show()
Analyzing most active uber base number¶
uber_15.columns
Index(['Dispatching_base_num', 'Pickup_date', 'Affiliated_base_num',
'locationID'],
dtype='object')
uber_foil = pd.read_csv(r"D:\Datasets\Uber-Jan-Feb-FOIL.csv")
uber_foil.shape
(354, 4)
uber_foil.head(2) # we can see we have active_vehicles column in this data
| dispatching_base_number | date | active_vehicles | trips | |
|---|---|---|---|---|
| 0 | B02512 | 1/1/2015 | 190 | 1132 |
| 1 | B02765 | 1/1/2015 | 225 | 1765 |
!pip install chart_studio
!pip install plotly
Requirement already satisfied: chart_studio in c:\users\monigabhaskar\anaconda3\lib\site-packages (1.1.0) Requirement already satisfied: plotly in c:\users\monigabhaskar\anaconda3\lib\site-packages (from chart_studio) (5.24.1) Requirement already satisfied: requests in c:\users\monigabhaskar\anaconda3\lib\site-packages (from chart_studio) (2.32.3) Requirement already satisfied: retrying>=1.3.3 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from chart_studio) (1.4.0) Requirement already satisfied: six in c:\users\monigabhaskar\anaconda3\lib\site-packages (from chart_studio) (1.17.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from plotly->chart_studio) (9.0.0) Requirement already satisfied: packaging in c:\users\monigabhaskar\anaconda3\lib\site-packages (from plotly->chart_studio) (24.2) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from requests->chart_studio) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from requests->chart_studio) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from requests->chart_studio) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from requests->chart_studio) (2025.4.26) Requirement already satisfied: plotly in c:\users\monigabhaskar\anaconda3\lib\site-packages (5.24.1) Requirement already satisfied: tenacity>=6.2.0 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from plotly) (9.0.0) Requirement already satisfied: packaging in c:\users\monigabhaskar\anaconda3\lib\site-packages (from plotly) (24.2)
import chart_studio.plotly as py
import plotly.graph_objs as go
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected = True)
uber_foil.columns
Index(['dispatching_base_number', 'date', 'active_vehicles', 'trips'], dtype='object')
px.box(x = 'dispatching_base_number', y = 'active_vehicles', data_frame = uber_foil)
Highest active base: B02764
- Active vehicles: 1,619
- Significantly higher than others, making it the most active base today.
Second highest active base: B02682
- Active vehicles: 600
Third highest active base: B02617
- Active vehicles: 596
Observation:
B02764 has nearly three times the vehicles compared to the next base, indicating it is the primary hub of activity.
B02682 and B02617 show comparable activity to each other but are much lower than B02764.
# if we need distribution and 5 summary stats plot then violin plot is the best choice
px.violin(x = 'dispatching_base_number', y = 'active_vehicles', data_frame = uber_foil)
Collect all the file data and merge it (All the month Raw data)¶
files = os.listdir(r"D:\Datasets")[-8:]
files
['uber-raw-data-apr14.csv', 'uber-raw-data-aug14.csv', 'uber-raw-data-janjune-15.csv', 'uber-raw-data-janjune-15_sample.csv', 'uber-raw-data-jul14.csv', 'uber-raw-data-jun14.csv', 'uber-raw-data-may14.csv', 'uber-raw-data-sep14.csv']
files.remove('uber-raw-data-janjune-15.csv')
files.remove('uber-raw-data-janjune-15_sample.csv')
final = pd.DataFrame()
path = r'D:\Datasets'
for file in files:
current_df = pd.read_csv(path + '/'+ file)
final = pd.concat([current_df,final])
final.shape
(4534327, 4)
final.duplicated().sum()
np.int64(82581)
final.drop_duplicates(inplace = True)
final.shape
(4451746, 4)
final.head(3)
| Date/Time | Lat | Lon | Base | |
|---|---|---|---|---|
| 0 | 9/1/2014 0:01:00 | 40.2201 | -74.0021 | B02512 |
| 1 | 9/1/2014 0:01:00 | 40.7500 | -74.0027 | B02512 |
| 2 | 9/1/2014 0:03:00 | 40.7559 | -73.9864 | B02512 |
rush_uber = final.groupby(['Lat', 'Lon'],as_index = False).size()
rush_uber
| Lat | Lon | size | |
|---|---|---|---|
| 0 | 39.6569 | -74.2258 | 1 |
| 1 | 39.6686 | -74.1607 | 1 |
| 2 | 39.7214 | -74.2446 | 1 |
| 3 | 39.8416 | -74.1512 | 1 |
| 4 | 39.9055 | -74.0791 | 1 |
| ... | ... | ... | ... |
| 574553 | 41.3730 | -72.9237 | 1 |
| 574554 | 41.3737 | -73.7988 | 1 |
| 574555 | 41.5016 | -72.8987 | 1 |
| 574556 | 41.5276 | -72.7734 | 1 |
| 574557 | 42.1166 | -72.0666 | 1 |
574558 rows × 3 columns
!pip install folium
Collecting folium Downloading folium-0.20.0-py2.py3-none-any.whl.metadata (4.2 kB) Collecting branca>=0.6.0 (from folium) Downloading branca-0.8.1-py3-none-any.whl.metadata (1.5 kB) Requirement already satisfied: jinja2>=2.9 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from folium) (3.1.6) Requirement already satisfied: numpy in c:\users\monigabhaskar\anaconda3\lib\site-packages (from folium) (2.1.3) Requirement already satisfied: requests in c:\users\monigabhaskar\anaconda3\lib\site-packages (from folium) (2.32.3) Requirement already satisfied: xyzservices in c:\users\monigabhaskar\anaconda3\lib\site-packages (from folium) (2022.9.0) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (3.0.2) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from requests->folium) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from requests->folium) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from requests->folium) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in c:\users\monigabhaskar\anaconda3\lib\site-packages (from requests->folium) (2025.4.26) Downloading folium-0.20.0-py2.py3-none-any.whl (113 kB) Downloading branca-0.8.1-py3-none-any.whl (26 kB) Installing collected packages: branca, folium -------------------- ------------------- 1/2 [folium] -------------------- ------------------- 1/2 [folium] ---------------------------------------- 2/2 [folium] Successfully installed branca-0.8.1 folium-0.20.0
import folium
basemap = folium.Map()
basemap
from folium.plugins import HeatMap
HeatMap(rush_uber).add_to(basemap)
<folium.plugins.heat_map.HeatMap at 0x13fa70c0ec0>
basemap